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1 Concepts and effectiveness of the cover-coefficient-based ciustering methodology for 
text.dMabases 
Fazli Can, Esen A. Ozkarahan 

December 1990 ACM Transactions on Database Systems (TODS), volume is issue 4 

Additional Information: full citation , abstract , references , citings , index 
terms, review 



Full text available: f § pdf(2. 74 MB) 



A new algorithm for document clustering is introduced. The base concept of the algorithm, 
the cover coefficient (CC) concept, provides a means of estimating the number of clusters 
within a document database and related indexing and clustering analytically. The CC 
concept is used also to identify the cluster seeds and to form clusters with these seeds. It is 
shown that the complexity of the clustering process is very low. The retrieval experiments 
show that the information-retrieval effectiv ... 

Keywords: cluster validity, clustering-indexing relationships, cover coefficient, decoupling 
coefficient, document retrieval, retrieval effectiveness 



2 Aframewprk^ 

C. f. Yu, W. Meng, S. Park 

June 1989 ACM Transactions on Database Systems (TODS), Volume 14 Issue 2 

Full text available* f& pdF(1.56 MB) Additional Information: full citation , abstract , references, citings , index 

terms, review 

The aim of an effective retrieval system is to yield high recall and precision (retrieval 
effectiveness). The nonbinary independence model, which takes into consideration the 
number of occurrences of terms in documents, is introduced. It is shown to be optimal 
under the assumption that terms are independent. It is verified by experiments to yield 
significant improvement over the binary independence model. The nonbinary model is 
extended to normalized vectors and is applicable to more genera ... 

3 Concepts of the cover coefficient-based clustering methodology 
Fazli Can, Esen A. Ozkarahan 

June 1985 Proceedings of the 8th annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available: 1i|.p.df(745 1 88 KB). 



Additional Information: fujj.cltatjon, abstract, rejerences, cjtincjs 



Document clustering has several unresolved problems. Among them are high time and 
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space complexity, difficulty of determining similarity thresholds, order dependence, 
nonuniform document distribution in clusters, and arbitrariness in determination of various 
cluster intiators. To overcome these problems to some degree, the cover coefficient based 
clustering methodology has been introduced. The concepts used in this methodology have 
created certain new concepts, relationships, and me ... 

Special issue on word sense disambiguation: Introduction to the special issue on word 

s.ense.disam 

Nancy Ide, Jean Veronis 

March 1998 Computational Linguistics, volume 24 issue l 

Full text available: A *"> aa na\iM 

^paft3 : 44 .(Vl.Bl.SEIH Additional Information: Ml-Mion, references, citinas 
Publisher Site 



BMYin&con^ 

Mark Sanderson, Bruce Croft 

August 1999 Proceedings of the 22nd annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available: |j| pdf( 100.05 KB) Additional Information: full citation, references, citings, index terms 



Keywords: concept hierarchy, multi-document summary, subsumption, term co-occurence 



6 Poster.pap^ 

Christina Yip Chung, Bin Chen 

July 2002 Proceedings of the eighth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ^fidft63A38.K.Bj Additional Information: Ml citation, abstract, references, index terms 

As information volume in enterprise systems and in the Web grows rapidly, how to 
accurately retrieve information is an important research area. Several corpus based 
smoothing techniques have been proposed to address the data sparsity and synonym 
problems faced by information retrieval systems. Such smoothing techniques are often 
unable to discover and utilize the correlations among terms. We propose CVS, a Correlation- 
Verification based Smoothing method, that considers co-occurrence information i ... 

Keywords: information retrieval, query expansion, smoothing, term clustering, text mining 



I§rm_cjusteri^^^ 

D. D. Lewis, W. B. Croft 

December 1989 Proceedings of the 13th annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available' ^dfi 1 62 MB^ Additional Information: full citation, abstract , references, citings, index 
^ * * terms 

Term clustering and syntactic phrase formation are methods for transforming natural 
language text. Both have had only mixed success as strategies for improving the quality of 
text representations for document retrieval. Since the strengths of these methods are 
complementary, we have explored combining them to produce superior representations. In 
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this paper we discuss our implementation of a syntactic phrase generator, as well as our 
preliminary experiments with producing phrase clusters. Th ... 

8 JnformMQD^ Q 
using concept hierarchies for mobiie clients 
D. L. Chan, R. W. P. Luk, W. K. Mak, H. V. Leong, E. K. S. Ho, Q. Lu 
March 2002 Proceedings of the 2002 ACM symposium on Applied computing 

Full text available: ^.pdg66Q i .36. KB} Additional Information: fuHcitaiion, abstract, references, index .terms 

Mobile clients have limited display and navigation capabilities. To browse a set of 
documents, an intuitive method is to navigate through concept hierarchies. To reduce 
semantic loading for each term that represents the concepts and the cognitive loading of 
users due to the limited display, similar documents are grouped together before concept 
hierarchies are constructed for each document group. Since the concept hierarchies only 
represent the salient concepts in the documents, term extraction i ... 

Keywords: browsing, concept hierarchy, information access, mobile agent, mobile 
computing, navigation, summarization 



Papers:.Xmcep^ 




Caroline Barriere, Fred Popowich 
August 1996 Proceedings of the 16th conference on Computational linguistics - Volume 
1 

Full text available: ^pdf(M8AQ.KB) Additional Information: fulj. citation, abstract, references 

Knowledge structure called Concept Clustering Knowledge Graphs (CCKGs) are introduced 
along with a process for their construction from a machine readable dictionary. CCKGs 
contain multiple concepts interrelated through multiple semantic relations together forming 
a semantic cluster represented by a conceptual graph. The knowledge acquisition is 
performed on a children's first dictionary. The concepts involved are general and typical of a 
daily life conversation. A collection of conceptual cluste ... 



10 The concept of dynamic analysis Q 
Thorns Bell 

October 1999 ACM SIGSOFT Software Engineering Notes , Proceedings of the 7th 

European software engineering conference held jointly with the 7th ACM 
SIGSOFT international symposium on Foundations of software 
engineering, volume 24 issue 6 

Full text available* f& pdf(1.37 MB) Additional Information: full citation , abstract, references , citings , index 

terrns 

Dynamic analysis is the analysis of the properties of a running program. In this paper, we 
explore two new dynamic analyses based on program profiling frequency Spectrum 
Analysis. We show how analyzing the frequencies of program entities in a single execution 
can help programmers to decompose a program, identify related computations, and find 
computations related to specific input and output characteristics of a program. Cover ... 

11 Fast detection of communication patterns in distributed executions B 
Thomas Kunz, Michiel F. H. Seuren 

November 1997 Proceedings of the 1997 conference of the Centre for Advanced Studies 
on Collaborative research 

Full text available: ^ .pdf(4,21.MBj Additional Information: MLcitatjon, abstract, references, index terms 

Understanding distributed applications is a tedious and difficult task. Visualizations based on 
process-time diagrams are often used to obtain a better understanding of the execution of 
the application. The visualization tool we use is Poet, an event tracer developed at the 
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University of Waterloo. However, these diagrams are often very complex and do not provide 
the user with the desired overview of the application. In our experience, such tools display 
repeated occurrences of non-trivial commun ... 

12 Semantic indexing for a compiete subject discipline 
Yi-Ming Chung, Qin He, Kevin Powell, Bruce Schatz 

August 1999 Proceedings of the fourth ACM conference on Digital libraries 

Full text available: ffi.pd.O.§@ l Z4.KB.l Additional Information: MLcitetjon.. references, Index.teims 



Keywords: MEDLINE, MEDSPACE, concept space, interspace, medical informatics, scalable 
semantics, semantic indexing, semantic retrieval 



3 Web clustering; Inferring hierarchical descriptions 
Eric Glover, David M. Pennock, Steve Lawrence, Robert Krovetz 

November 2002 Proceedings of the eleventh international conference on Information 
and knowledge management 

Full text available: f gpdff239.32 KB) Additional Information: lUILcitatjon, abMrM. reference^ citings, index 

ta ^ terms 

We create a statistical model for inferring hierarchical term relationships about a topic, 
given only a small set of example web pages on the topic, without prior knowledge of any 
hierarchical information. The model can utilize either the full text of the pages in the cluster 
or the context of links to the pages. To support the model, we use "ground truth" data 
taken from the category labels in the Open Directory. We show that the model accurately 
separates terms in the following classes: se/... 

Keywords: cluster naming, feature selection, hierarchical relationships, statistical models, 
web analysis 



14 Research track posters: Cluster-based concept invention for statistical reiationai 
learning 

Alexandrin Popescul, Lyle H. Ungar 

August 2004 Proceedings of the 2004 ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ■g odfMST.SI KB? Additional Information: full citation , abstract , references, index terms 

We use clustering to derive new relations which augment database schema used in 
automatic generation of predictive features in statistical relational learning. Entities derived 
from clusters increase the expressivity of feature spaces by creating new first-class 
concepts which contribute to the creation of new features. For example, in CiteSeer, papers 
can be clustered based on words or citations giving "topics", and authors can be clustered 
based on documents they co-author giving "communities" ... 

Keywords: clustering, feature generation, relational learning 

15 Using multipie knowledge sources for word sense discrimination 
Susan W. McRoy 

March 1992 Computational Linguistics, Volume 18 Issue 1 

Full text available: ^ ^ 0 (§| 

H| poy^Z^A jvjp) w Additional Information: full citation , abstract , references , citings 
PubJj§b.exSite 
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This paper addresses the problem of how to identify the intended meaning of individual 
words in unrestricted texts, without necessarily having access to complete representations 
of sentences. To discriminate senses, an understander can consider a diversity of 
information, including syntactic tags, word frequencies, collocations, semantic context, role- 
related expectations, and syntactic restrictions. However, current approaches make use of 
only small subsets of this information. Here we will des ... 

16 An.err^ method for providing approximate query answers Q 

W. W. Chu, K. Chiang, C. Hsu, H. Yau 
December 1996 Communications of the ACM 

Full text available: W{ pdf(351.76 KB) Additional Information: full citation , references, citings, index terms 



17 On modeling of information retrieval concepts in vector spaces 
S. K.M. Wong, W. Ziarko, V. V. Raghavan, P. C.N. Wong 

June 1987 ACM Transactions on Database Systems (TODS), volume 12 issue 2 

Full text available: 1jS pdftl .80 MB) Additional Information: MLcjtation, abstract, references, citings, index 
^ " ** terms, review 

The Vector Space Model (VSM) has been adopted in information retrieval as a means of 
coping with inexact representation of documents and queries, and the resulting difficulties 
in determining the relevance of a document relative to a given query. The major problem in 
employing this approach is that the explicit representation of term vectors is not known a 
priori. Consequently, earlier researchers made the assumption that the vectors 
corresponding to terms are pairwise orthogonal. Such an a ... 

18 Special Issue on. word sen^ 
Hinrich Schutze 

March 1998 Computational Linguistics, Volume 24 Issue 1 

Full text available: m , fM 0 -^md\^1 

^pafQ J/.,:y1Bj._^ Additional Information: Ml citation, abstract, references, citings 
Publisher Site 

This paper presents context-group discrimination, a disambiguation algorithm based on 
clustering. Senses are interpreted as groups (or clusters) of similar contexts of the 
ambiguous word. Words, contexts, and senses are represented in Word Space, a high- 
dimensional, real-valued space in which closeness corresponds to semantic similarity. 
Similarity in Word Space is based on second-order co-occurrence: two tokens (or contexts) 
of the ambiguous word are assigned to the same sense cluster if the wo ... 



19 A. survey. of Web .metrics 

Devanshu Dhyani, Wee Keong Ng, Sourav S. Bhowmick 
December 2002 ACM Computing Surveys (CSUR), Volume 34 Issue 4 

Full text available- sdff 289 28 KB) Additional Information: full citation , abstract , references, citings, index 

te.rms 

The unabated growth and increasing significance of the World Wide Web has resulted in a 
flurry of research activity to improve its capacity for serving information more effectively. 
But at the heart of these efforts lie implicit assumptions about "quality" and "usefulness" of 
Web resources and services. This observation points towards measurements and models 
that quantify various attributes of web sites. The science of measuring all aspects of 
information, especially its storage and retrieval or ... 

Keywords: Information theoretic, PageRank, Web graph, Web metrics, Web page 
similarity, quality metrics 
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Web searching: WISE-duster: ciustering e-commerce search engines automaticailv 
November 2004 Proceedings of the 6th annual ACM international workshop on Web 
information and data management 

Full text available: ^.pdf(366 4.1 KB) Additional Information: M cit^jon, abstract 
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21 AparajJeJ.ajgorjt^ 

Edward Omiecinski, Peter Scheuermann 

December 1990 ACM Transactions on Database Systems (TODS), volume 15 Issue 4 



Full text available: 



3d f( 1.32 MB) 



Additional Information: full citation , abstract, references , citings , index 
terms, review 



We present an efficient heuristic algorithm for record clustering that can run on a SIMD 
machine. We introduce the P-tree, and its associated numbering scheme, which in the split 
phase allows each processor independently to compute the unique cluster number of a 
record satisfying an arbitrary query. We show that by restricting ourselves in the merge 
phase to combining only sibling clusters, we obtain a parallel algorithm whose speedup ratio 
is optimal in the number of processors used. Final ... 

22 Exploiting clustering and phrases for context-based information retrieval | 
Peter G. Anick, Shivakumar Vaithyanathan 

July 1997 ACM SIGIR Forum , Proceedings of the 20th annual international ACM 

SIGIR conference on Research and development in information retrieval, 

Volume 31 Issue SI 

Full text available: pdf(1.55 MB) Additional Information: full citation , references , citings, index terms 



23 Concept based query expansion 
Yonggang Qiu, Hans-Peter Frei 

July 1993 Proceedings of the 16th annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available* l ^pdfM.05 MB? Additional Information: MLcjtatjon, abstract, references, citings, index 



terms 

Query expansion methods have been studied for a long time - with debatable success in 
many instances. In this paper we present a probabilistic query expansion model based on a 
similarity thesaurus which was constructed automatically. A similarity thesaurus reflects 
domain knowledge about the particular collection from which it is constructed. We address 
the two important issues with query expansion: the selection and the weighting of 
additional search terms. In contrast to earlier methods, ... 

24 Buj!ding.Micjenl and 

Weiyi Meng, Clement Yu, King-Lup Liu 
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March 2002 ACM Computing Surveys (CSUR), volume 34 issue l 

Full text available: ffi pd«416.07 KB) Additional lnformation: M^ifia ^» references, citings, index 



terms 



Frequently a user's information needs are stored in the databases of multiple search 
engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search 
engines and identify useful documents from the returned results. To support unified access 
to multiple search engines, a metasearch engine can be constructed. When a metasearch 
engine receives a query from a user, it invokes the underlying search engines to retrieve 
useful information for the user. Metasearch engines have ... 

Keywords: Collection fusion, distributed collection, distributed information retrieval, 
information resource discovery, metasearch 



25 information retrieval & extraction: N-gram cluster identification during empirical 

knowledge representation generation 
Robin Collier 

August 1994 Proceedings of the 15th conference on Computational linguistics - Volume 
2 

Full text available: ||j pdf(467.30 KB) Additional Information: full citation, abstract , references 

This paper presents an overview of current research concerning knowledge extraction from 
technical texts. In particular, the use of empirical techniques during the identification and 
generation of a semantic representation is considered. A key step is the discovery of useful 
n-grams and correlations between clusters of these n-grams. 

Keywords: knowledge representation, language understanding, large text corpora 



26 iniproyjngi^ 

Jinxi Xu, W. Bruce Croft 

January 2000 ACM Transactions on Information Systems (TOIS), Volume 18 Issue 1 

Full text available* ■fUlpdfi 1 93 02 KB) Add ' tional Information: full citation , abstract , references, citings , index 

""' " terms, review 

Techniques for automatic query expansion have been extensively studied in information 
research as a means of addressing the word mismatch between queries and documents. 
These techniques can be categorized as either global or local. While global techniques rely 
on analysis of a whole collection to discover word relationships, local techniques emphasize 
analysis of the top-ranked documents retrieved for a query. While local techniques have 
shown to be more effective that global techniques in ... 

Keywords: cooccurrence, document analysis, feedback, global techniques, information 
retrieval, local context analysis, local techniques 



Modeling word occurrences for the compression of concordances 
A. Bookstein, S. T. Klein, T. Raita 

July 1997 ACM Transactions on Information Systems (TOIS), Volume 15 Issue 3 

Full text available: m pdff630.99 KBi Additional ,nformatjon: M ciiatjon, abstract, references, index terms, 
^ v * review 

An earlier paper developed a procedure for compressing concordances, assuming that all 
alements occurred independently. The models introduced in that paper are extended here 
to take the possiblity of clustering into account. The concordance is conceptualized as a set 
of bitmaps, in which the bit locations reporesent documents, and the one-bits represent .the 
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occurrence of given terms. Hidden Markov Models (HMM's) are used to describe the 
clustering of the one-bits. However, for computational ... 

Keywords: classification of graph nodes, concordance organization, concordance storage, 
graph structure 



28 EvMuatjpn of an^ 

Howard Turtle, W. Bruce Croft 

July 1991 ACM Transactions on Information Systems (TOIS), Volume 9 Issue 3 
Full text available: 1J odf(240 M6j Additional Information: full citation , references , ciyngs, index terms , 



review 



Keywords: document retrieval, inference networks, network retrieval models 



29 JR-KM-iXinfo^^^ 
extractjon^ 
search 

Ana Maguitman, David Leake, Thomas Reichherzer, Filippo Menczer 
November 2004 Proceedings of the Thirteenth ACM conference on Information and 
knowledge management 

Full text available: ^pdf(25370.KB) Additional Information: MLcitatjon, abstract, reMences, index Jems 

Effective knowledge management may require going beyond initial knowledge capture, to 
support decisions about how to extend previously-captured knowledge. Electronic 
<i>concept maps,</i> interlinked with other concept maps and multimedia resources, can 
provide rich <i>knowledge models</i> for human knowledge capture and sharing. This 
paper presents research on methods for supporting experts as they extend these 
knowledge models, by searching the Web for new context- relevant to ... 

Keywords: acquisition tools, automatic topic search, concept mapping, context, 
information retrieval, knowledge, knowledge management 



30 Word„sense.d[sam 
Ido Dagan, Alon Itai 

December 1994 Computational Linguistics, Volume 20 Issue 4 

Full text available: ^ (|| 

^pdftZp./„:/I.BiOT Additional Information: MLQitatjon, abstract, references, citings 
Publisher Site 

This paper presents a new approach for resolving lexical ambiguities in one language using 
statistical data from a monolingual corpus of another language. This approach exploits the 
differences between mappings of words to senses in different languages. The paper 
concentrates on the problem of target word selection in machine translation, for which the 
approach is directly applicable. The presented algorithm identifies syntactic relations 
between words, using a source language parser, and maps t ... 

31 AutomMi^ 

H. P. Edmundson, R. E. Wyllys 

May 1961 Communications of the ACM, Volume 4 Issue 5 

Full text available: pdgl 04 MB) Additional Information: full citation , abstract, references , cliinos . index 

' " " terms 
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In preparation for the widespread use of automatic scanners which will read documents and 
transmit their contents to other machines for analysis, this report presents a new concept in 
automatic analysis: the relative-frequency approach to measuring the significance of words, 
word groups, and sentences. The relative-frequency approach is discussed in detail, as is its 
application to problems of automatic indexing and automatic abstracting. Included in the 
report is a summary of automatic ana ... 

S. Kumar 

October 1968 Journal of the ACM (JACM), volume is issue 4 

Full text available: 'f|| pdft1.3Q MB) Additional Information: full citation , references, citings, index terms 



33 HyPursujt^ 
clustering 

Ron Weiss, Bienvenido Velez, Mark A. Sheldon 

March 1996 Proceedings of the the seventh ACM conference on Hypertext 

Full text available: ^pdf(2 : Q0.[ytB). Additional Information: MLQjtatjon, Meiences, citings., index terms 



34 information Retrieval and Text Mining: A clustering algorithm for asymmetricaliy related 
data with applications to text mining 

K. Krishna, Raghu Krishnapuram 

October 2001 Proceedings of the tenth international conference on Information and 
knowledge management 

Full text available: p r>df<593.39 KB) Addjtional ,nformation: Mutation, abstract, references , cjfioss, index 
"""^ " terms 

Clustering techniques find a collection of subsets of a data set such that the collection 
satisfies a criterion that is dependent on a relation defined on the data set. The underlying 
relation is traditionally assumed to be symmetric. However, there exist many practical 
scenarios where the underlying relation is asymmetric. One example of an asymmetric 
relation in text analysis is the inclusion relation, i.e., the inclusion of the meaning of a block 
of text in the meaning of another block. In th ... 

35 An evaluation of phrasal and clustered representations on a text categorization task 
David D. Lewis 

June 1992 Proceedings of the 15th annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available- 9 --dff 1 22 MB > Additional Information: M ©Mion, abstract, references, citings, index 
' M *' terms 

Syntactic phrase indexing and term clustering have been widely explored as text 
representation techniques for text retrieval. In this paper we study the properties of phrasal 
and clustered indexing languages on a text categorization task, enabling us to study their 
properties in isolation from query interpretation issues. We show that optimal effectiveness 
occurs when using only a small proportion of the indexing terms available, and that 
effectiveness peaks at a higher feature set size and ... 

36 The Logics 

Toby J. Teorey, James P. Fry 

June 1980 ACM Computing Surveys (CSUR), volume 12 issue 2 

Full text available: f |l pdf(2.61 MB) Additional Information: full citation , references, citings , index terms 
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37 Web mining, tools, and performance evaluation: Concept extraction and association 
from cancer literature 

Yueyu Fu, Travis Bauer, Javed Mostafa, Mathew Palakal, Snehasis Mukhopadhyay 
November 2002 Proceedings of the 4th international workshop on Web information and 
data management 

Full text available- f a P dff272.50 KB) Additional '"formation: M citation, abstract, referees, cities, index 
***** * terms 

There is a large and growing body of web accessible biomedical literature. As this body of 
electronic literature grows, so does the possibility that document analysis techniques can be 
used to automatically extract useful biomedical information from them, particularly in the 
discovery of key concepts dealing with genes, proteins, drugs, and diseases and 
associations among these concepts. VCGS (Vocabulary Cluster Generating System) was 
designed to automatically extract and determine associations ... 

Keywords: web data mining, web information extraction 

38 The use of duster hierarchies in hypertext information retrieval 
D. B. Crouch, C. J. Crouch, G. Andreas 

November 1989 Proceedings of the second annual ACM conference on Hypertext 

Full text available* fig j pdff1.PS ME*) Additional Information: fulj.citation, abstract, Merences, citjnss, index 
^ ' terms 

The graph-traversal approach to hypertext information retrieval is a conceptualization of 
hypertext in which the structural aspects of the nodes are emphasized. A user navigates 
through such hypertext systems by evaluating the semantics associated with links between 
nodes as well as the information contained in nodes. [Fris88] In this paper we describe an 
hierarchical structure which effectively supports the graphical traversal of a document 
collection in a hypertext system ... 

39 information retrieval 1: Pruning long documents for distributed information retrieval \ 
Jie Lu, Jamie Callan 

November 2002 Proceedings of the eleventh international conference on Information 
and knowledge management 

Full text available* fH pdf(1S5 78 KB) Additional Information: full .citation, absfract, refe.ren.ces, .citings, index 
* terms 

Query-based sampling is a method of discovering the contents of a text database by 
submitting queries to a search engine and observing the documents returned. In prior 
research sampled documents were used to build resource descriptions for automatic 
database selection, and to build a centralized sample database for query expansion and 
result merging. An unstated assumption was that the associated storage costs were 
acceptable. When sampled documents are long, storage costs can be large. This pape ... 

Keywords: distributed information retrieval, document pruning 



The use of phrases and structured queries in information retrieval 
W. Bruce Croft, Howard R. Turtle, David D. Lewis 

September 1991 Proceedings of the 14th annual international ACM SIGIR conference on 
Research and development in information retrieval 

Full text available: ^.pdS;1..35.MBj. Additional Information: MLcitatjon, references, .citings, indexlerms 
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* ABSTRACT 

Query expansion methods have been studied for a long time - with debatable success in many 
instances. In this paper we present a probabilistic query expansion model based on a similarity 
thesaurus which was constructed automatically. A similarity thesaurus reflects domain knowledge 
about the particular collection from which it is constructed. We address the two important issues with 
query expansion: the selection and the weighting of additional search terms. In contrast to earlier 
methods, our queries are expanded by adding those terms that are most similar to the concept of the 
query, rather than selecting terms that are similar to the query terms. Our experiments show that 
this kind of query expansion results in a notable improvement in the retrieval effectiveness when 
measured using both recall-precision and usefulness. 
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In this paper, an attempt is made to show that conventional database management 
system software, in particular those of CODASYL type, can be effectively replaced by 
database machines with good performance. The replacement of CODASYL system software 
involves two main steps: (i) In order to preserve the notions of CODASYL records, sets,- 
areas, and others, we need a methodology for database transformation so that an existing 
CODASYL database may be transformed into suita ... 



Keywords: CODASYL data model, DBC, Database machines, Database management 
systems, Database transformation, Network data model, Query translation, Relative 
performance. 



Multjkey..^ 




J. W. Chang, J. H. Lee, Y. J. Lee 
May 1989 ACM SIGIR Forum , Proceedings of the 12th annual international ACM 

SIGIR conference on Research and development in information retrieval, 

Volume 23 Issue 1-2 

Full text available: « pdff963.83 KB'; Additjonal lnformatlon: tuii citation > abstract, references , SiflQflS, index 
^ * terms 

In order to improve the two-level signature file method designed by Sacks-Davis et al. 
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means of these terms so that we may achieve good performance on retrieval. Meanwhile 
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Organizing Web search results into clusters facilitates users' quick browsing through 
search results. Traditional clustering techniques are inadequate since they don't generate 
clusters with highly readable names. In this paper, we reformalize the clustering problem 
as a salient phrase ranking problem. Given a query and the ranked list of documents 
(typically a list of titles and snippets) returned by a certain Web search engine, our 
method first extracts and ranks salient phrases as candidate c ... 
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With the proliferation of image data, the need to search and retrieve images efficiently and 
accurately from a large image database or a collection of image databases has drastically 
increased. To address such a demand, a unified framework called <i>Markov Model 
Mediators</i> (MMMs) is proposed in this paper to facilitate conceptual database 
clustering and to improve the query processing performance by analyzing the summarized 
knowledge. The unique characteristics of MMMs are that it ... 
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Two-dimensional contingency or co-occurrence tables arise frequently in important 
applications such as text, web-log and market-basket data analysis. A basic problem in 
contingency table analysis is co-clustering: simultaneous clustering of the rows and 
columns. A novel theoretical formulation views the contingency table as an empirical joint 
probability distribution of two discrete random variables and poses the co-clustering 
problem as an optimization problem in information theory 
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While handovers of voice calls in a wide area mobile environment are well understood, 
handovers of multi-media traffic in a local area mobile environment is still in its early stage 
of investigation. Unlike the public wireless networks, handovers for multi-media Wireless 
LANs (WLANs) have special requirements. In this paper, the problems and challenges 
faced in a multi-media WLAN environment are outlined and a multi-tier wireless cell 
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Term-based representations of documents have found wide-spread use in information 
retrieval. However, one of the main shortcomings of such methods is that they largely 
disregard lexical semantics and, as a consequence, are not sufficiently robust with respect 
to variations in word usage. In this paper we investigate the use of concept-based 
document representations to supplement word- or phrase-based features. The utilized 
concepts are automatically extracted from documents via probabilistic late ... 
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We describe a visualization technique that uses brushed, parallel histograms to aid in 
understanding concept drift in multidimensional problem spaces. This technique illustrates 
the relationship between changes in distributions of multiple antecedent feature values 
and the outcome distribution. We can also observe effects on the relative utilization of 
predictive rules. Our parallel histogram technique solves the over-plotting difficulty of 
parallel coordinate graphs and the difficulty of compar ... 
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In contrast to other kinds of libraries, software libraries need to be conceptually organized. 
When looking for a component, the main concern of users is the functionality of the 
desired component; implementation details are secondary. Software reuse would be 
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enhanced with conceptually organized large libraries of software components. In this 
paper, we present GURU, a tool that allows automatical building of such large software 
libraries from documented software components. We focus here on ... 
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A research prototype is presented for semantic indexing and retrieval in Information 
Retrieval. The prototype is motivated by a desire to provide a more efficient and effective 
information retrieval system compared to the current state of the art. An overview of the 
Interspace architecture layers is discussed. An object model supporting semantic 
operations is developed. The model contains a rich set of classes and relationships of the 
data for the semantic indexing module. The basis of our ... 
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Text compression is of considerable theoretical and practical interest. It is, for example, 
becoming increasingly important for satisfying the requirements of fitting a large database 
onto a single CD-ROM. Many of the compression techniques discussed in the literature are 
model based. We here propose the notion of a formal grammar as a flexible model of text 
generation that encompasses most of the models offered before as well as, in principle, 
extending the possibility of compression to a ... 
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High-dimensional collections of 0—1 data occur in many applications. The attributes in 
such data sets are typically considered to be unordered. However, in many cases there is a 
natural total or partial order &pr; underlying the variables of the data set. Examples of 
variables for which such orders exist include terms in documents, courses in enrollment 
data, and paleontological sites in fossil data collections. The observations in such 
applications are flat, unordered sets; however, the data s ... 
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This paper proposes a new class-based method to estimate the strength of association in 
word co-occurrence for the purpose of structural disambiguation. To deal with sparseness 
of data, we use a conceptual dictionary as the source for acquiring upper classes of the 
words related in the co-occurrence, and then use t-scores to determine a pair of classes to 
be employed for calculating the strength of association. We have applied our method to 
determining dependency relations in Japanese and prepos ... 
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The requirements for effective search and management of the WWW are stronger than 
ever. Currently Web documents are classified based on their content not taking into 
account the fact that these documents are connected to each other by links. We claim that 
a page's classification is enriched by the detection of its incoming links' semantics. This 
would enable effective browsing and enhance the validity of search results in the WWW 
context. Another aspect that is underaddressed and str ... 

Keywords: Document clustering, Link analysis, Link management, Semantics, Similarity 
measure, World Wide Web 
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A novel technique for automatic thesaurus construction is proposed. It is based on the 
complementary use of two tools: (1) a Term Extraction tool that acquires term candidates 
from tagged corpora through a shallow grammar of noun phrases, and (2) a Term 
Clustering tool that groups syntactic variants (insertions). Experiments performed on 
corpora in three technical domains yield clusters of term candidates with precision rates 
between 93% and 98%. 
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